Who are we? We are the Team "RandomForest". The team consits of Simon Huber, Manuel Müller, Sevda Cürük and Solomun Araya
It is said, that the Team "RandomForest" is one to be aware of! ;)
What are we doing?
The setup as described in the task is following: We are (pretending to be) recently hired data scientists by a company to build a machine learning concept for predicting car prices. The company buys cars to resell them. Therefore, it is highly useful to have a system which predicts the car prices based on different features of the car. Nowdays, the company bases the prices on gut feelings, but with our system, the company is going to be able to predict car on old data. Also, we can gain a lot of insights for our company on which are the main attributes to influence the price. Without the possibilities of machine learning we would continue to trust our gut feeling and compare our cars with similar cars on the internet. This could be a real world task as a data scientist.
For real we are students at the Bern University of Applied Sciences and this notebook is a task in our data science course. We're competing against our classmates in a Kaggle Competition. The team which makes the best predictions on a test set wins.
So we are trying to solve the above described business case and at the same time we want for sure beat our classmates in the competition and build the best model.
Data
Our data on which our model is based on was crawled by an algorithm on "Ebay Kleinanzeigen" (a German reselling website). In this dataset (our Train data), every single car is described by following attributes:
A second Dataset called Test is created just to evaluate the model performance. It's actually a slice of the whole car dataset but Since the target value to predict on is the feature price, this feature was removed before we started to work with the dataset. Therefore, predictions can be compared to the actual sales price, and hence its performance.
Based on our data and our goal, we can assume that this will be a common supervised regression task.
Performance
The performance of the model is measured with the Root Mean Squared Error:
$$ \mbox{RMSE} = \sqrt{\frac{1}{m}\sum_{i=1}^m \left(\hat{y}_i - y_i\right)^2}$$How we gonna reach our goal?
Knowledge and a huge amount of try, try, try. The aim goal is to go step-by-step towards the target of an acceptable RMSE score. Therefore, we are going to use supervisedmachine learning algorythms. After preprocessing the datasets, with feature importance we are going to evaluate the most important parameters for tuning hyperparameters. Therefore, the score should icrease massively.
Why our model?
So far, sales person of the company had to define all prices out of gut feeling. Although they built up a huge amount of knowledge over the last years in resaleing cars, prices were every thing else than optimized. An alternative would be to simply use excel, which is not that useful since we are talking here about a huge amount of data. Therefore, a machine learning model is highly recommended.
Pre Assumptions
We assume that the dataset contains several features with a very low correlation to our predictive feature price, such as:
We asume (based on experience) that the kilometers and the age of the cars have a big influence on the reselling price.
Also we lost a thought about the distribution, the description and the shape of the two datasets testand train. It would be very useful, if they are similar distributed.
We may be totally wrong with these assumptions. Hence, in the first chapter we will check the data to get an insight.
In the following paper, we are going to explain step by step how we did it.
# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
%%capture
# uncomment to update the library if working locally
!pip install dslectures --upgrade
import jedi
from dslectures.core import display_large
from dslectures.structured import proc_df
from dslectures.core import (
get_dataset,
display_large,
convert_strings_to_categories,
rf_feature_importance,
plot_feature_importance,
plot_dendogram,
)
from pathlib import Path
from tqdm import tqdm
import time
from numpy.testing import assert_array_equal
from scipy import stats
#data wrangling
import pandas as pd
import numpy as np
#data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
#machine learning
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor as MLP
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
#uncoment the following if you havent installed xgbost
#!pip install xgboost
import xgboost as xgb
np.warnings.filterwarnings('ignore')
datapath = Path('data/')
Because this is a Kaggle competition, the datasets are available on: https://www.kaggle.com/c/bfh-ds-competition-2020/data.
As said before, the data is crawled from the German website "Ebay Kleinanzeigen".
#load train and test data from .csv
train = pd.read_csv(datapath/'train.csv')
test = pd.read_csv(datapath/'test.csv')
Create X and y
To have a consistent score, which represents the score on the test set we have to cut off a validation set before the preprocessing. The main idea is that we should consider to preprocess the validation set the same way as the test set, that the RMSE score shown in our notebook below will not be massivly different from the RMSE score reached in the submission on kaggle.
#create the feature matrix X and the target vector y
X = train.drop('price', axis=1)
y = train['price']
#split the train set into a test and a train set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.15, random_state=42)
print(f"{len(X_train)} train rows + {len(X_valid)} valid rows")
#merge together the train set
train = X_train
train['price'] = y_train
#merge the validation set and save it for later
valid = X_valid
valid['price'] = y_valid
valid.to_csv(datapath/'valid1.csv', index = False)
train.shape, valid.shape
As a first step, gaining a general overview, some insights of the Data. Therefore, it is helpful to use commands like .shape, .info, .head(), .tail()
train.shape, test.shape
We can see, that the trainset has 248923 rows and 21 columns. The testset has 122541 rows and 20 columns.
Naturally, because the testset does not contain the to predicting feature 'price'.
# ratio of trainset and testset
train_rat = str(round((100 / (train.shape[0]+valid.shape[0]+test.shape[0])) * train.shape[0],0))
valid_rat = str(round((100 / (train.shape[0]+valid.shape[0]+test.shape[0])) * valid.shape[0],0))
test_rat = str(round((100 / (train.shape[0]+valid.shape[0]+test.shape[0])) * test.shape[0],0))
print('Train Set: '+train_rat+'%')
print('Validation Set: '+valid_rat+'%')
print('Test Set: '+test_rat+'%')
Nice! The proportion of the two datasets are already optimally distributed. The trainset + the validset is about 2/3 and the testset is about 1/3 of all datas availabl.
#information on numerical data
train.describe()
Important: our focusing variable is the numerical variable
price. At the end we want to predict the price of the cars in the testset.
test.describe()
Note: Train and Test set have similar distributions.
Check Unique Values
As we now know, the number of columns in the trainset is 211584. and in thetestset is 122541. Therefore, since the 'carId' is the keyvalue to exactly identify any car in the dataset, the amount of unique carId's in the certain dataset should be equal to the number of rows within the dataset. Let us check that
train['carId'].nunique() == len(train), test['carId'].nunique() == len(test)
So basically, we just verified that every car in our datasets does have a unique carId. So we do not have duplicated in our datasets, which is very good. Otherwise, we had to ged rid of duplicates.
The .dtypes allows us to get an overwiew of the features and their datatype, which will be important in later actions. To finally predict prices of the testset, it is necessaire to have only nummerical values.
train.dtypes, test.dtypes
There are a lot of object or categorical dtypes. We are going to make them numerical later
First we want to explore the numerical columns. We assume that we have to make some adjustments to the numerical data before we can build our ml model.
Outliers always disturb any statistic results. Therefore, we create the function remove_outliers. In a early state we applied this function, with upperbound 0.98 and lower bound 0.02 to all features with outliers. But after several calculations on the RMSE, we realized that not every feature with outliers are best to use this upper bound and lower bound. Hence, we individually define the lower bound and upper bound.
def remove_outliers(df, col, low = 0.02, up = 0.98):
"""
A function which detects the 0.02 / 0.98 (or any other custom) quantiles and removes
the rows which are outside these boundaries.
Args:
df: A pandas DataFrame
col: A column of a Pandas DataFrame as string (e.g. 'column').
low: lower boundary
up: upper boundary
Note: The function returns a new df!
--> train_without_outliers = remove_outliers(train, 'price', 0.05, 0.96)
"""
#find the quantiles
upper_boundary = df[col].quantile(up)
lower_boundary = df[col].quantile(low)
#return the df according to the boundaries
if (up == 1) and (low == 0):
df_return = df
elif (up != 1) and (low == 0):
df_return = df.loc[(df[col] < upper_boundary)]
elif (up == 1) and (low != 0):
df_return = df.loc[(df[col] > lower_boundary)]
else:
df_return = df.loc[(df[col] < upper_boundary) & (df[col] > lower_boundary)]
#print out the new minimum and maximum of the column
print(df_return[col].min(), df_return[col].max())
#return the df
return df_return
Price
Because the target vector is the most important numerical value, we will first focus on this one.
As it can be seen, there are huge outliersin the price distribution. Hence, they would massivly desturb the models accuracy and reliability in its predictions. Therefore, we are going to ged rid of an appropriate amount of these outliers.
sns.boxplot(x = train['price'])
According to AutoBild (german car magazine) there are no 'normal' (serial production) cars above 200'000 euros. So let's remove them first.
#remove unrealistic values
train = train.loc[train['price'] < 200000]
sns.boxplot(x = train['price'])
sns.distplot(train['price'])
Looks already better. But as mentioned in the kaggle notebook https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition ml models can't handle not normal distributions. A way to shape the distribution is to adapt a log function to the data.
#Check the distribution in a logarithmic scale
sns.distplot(np.log1p(train["price"])).set_ylabel("distributionValue");
It seems to work. But after we tested it out the change of price to a logarithmic scale doesn't affect the score so we let the price feature as it is for better understanding.
Age
We create a new feature called age to substitute the yearOfRegistration and dateCrawled features. It's more understandable and more consistent to new data.
The pandas function .to_datetime can be used that the model understands the focused number as a date. So we can calculate the age.
#convert 'dateCrawled' to the datetime feature
train['dateCrawled'] = pd.to_datetime(train['dateCrawled'])
#create the new 'age' feature by calculate the year of 'dateCrawled' minus 'yearOfRegistration'
train['age'] = train['dateCrawled'].dt.year - train['yearOfRegistration']
#print out the boxplot of age
sns.boxplot(x = train['age'])
As we can see there are crazy numbers for age. This is because the yearOfRegistration feature is really messy. We want to remove these nonsense values!
Because "Back to the Future" is a SciFi movie, the year of registration can't be in the future for a second hand car . So we want to cut away these cars.
According to Wikipedia, the mass production of cars started in 1901. So we want to throw away at least all cars which are older than that.
#remove nonsense values
#throw away all cars with 'yearOfRegistration' before 1901
train = train.loc[train['yearOfRegistration'] > 1901]
#drop 'yearOfRegistration because' this feature is now represented by 'age'
train.drop(['yearOfRegistration'], axis=1, inplace = True)
#throw away all cars with 'yearOfRegistration' in the future
train = train.loc[train['age'] >= 0]
sns.boxplot(x = train['age'])
We first removed outliers of the new feature age with our beautiful function. But after calculating the RMSE we realized that the score was getting worse, so we outcommented these lines.
# remove the outliers in 'age'
#train = remove_outliers(train, 'age',0,1)
#sns.distplot(train['age'])
powerPS
sns.boxplot(x = train['powerPS'])
A quick google search shows, that in 2016 (latest dateCrawled) the most powerful production car is the Bugatti Chiron/Sport/Divo with 1500 PS (horse power). So there are a lot of missleading values we want to kick out.
Because every car above 1000 PS is a crazy sport car, we cut away all cars above 1000 (cut away outliers).
#remove nonsense values
train = train.loc[train['powerPS'] <= 1000]
plt.subplot(2,1,1)
sns.boxplot(x = train['powerPS'])
plt.subplot(2,1,2)
sns.distplot(train['powerPS']).set_ylabel("distributionValue")
kilometer
plt.subplot(2,1,1)
sns.boxplot(x = train['kilometer'])
plt.subplot(2,1,2)
sns.distplot(train['kilometer']).set_ylabel("distributionValue")
Seems we have a boundary for kilometer at 150'000. Plus the kilometers are grouped in categories so we treat the feature like a category but leave the values numeric.
(((train.loc[train['kilometer'] == 150000].shape[0])/train.shape[0]))
Thats a problem. Half of the training data has a kilometer value of 150000. Let's have a look how it behaves for the test set.
sns.distplot(train['kilometer'], axlabel="kilometer").set_ylabel("distributionValue"), train['kilometer'].max()
The distribution of train and test set look similar. Still the data looks more categorical than numeric. One idea is to create categories for the kilometer feature. So we did. We choose the range of 12 categories to cluster the amount of kilometer of each car in, but the RMSE downgrades. Hence, we removed this step and let the kilometer as they are in the initial datsset.
name_len
In a kaggle forum discussion about the dataset, someone had the idea to create a feature for the lengt of the name string. So we can get a benefit out of the name feature.
Important
We right now safe the trainset as trainset. The reason for this is, that during the process to optimize the score, we did several possible approaches. The following steps, commands, and changes will not change the actual trainset, because we will reload the set afterwards again.
#safe data as .csv file
train.to_csv(datapath/'trainset.csv', index = False)
Now the set is saved. When we reload the trainset following changes will not be adapted, due to bad influence of the models performance.
#create the 'name_len' feature
train['name_len'] = train['name'].str.len()
#remove outliers
train = remove_outliers(train, 'name_len',0,0.99999)
sns.boxplot(x=train['name_len'])
postal code
The data seems to be from Germany (crawled from eBay Kleinanzeigen which is a German website). Maybe we can get more information out of the postal code.
According to wikipedia the German postal code has 5 digits. Let's check on that.
train.loc[train['postalCode'] < 9999].shape[0], train.loc[train['postalCode'] > 9999].shape[0],train.loc[train['postalCode'] > 99999].shape[0]
10'000 instances (~5%) have a postal code which doesn't suit the German system. Let's check if there is a systematic error (e.g. all have the same number)
sns.distplot(train.loc[train['postalCode'] < 9999]['postalCode']).set_ylabel("distributionValue")
Seems not to be the case. The Postal codes which not suit the German system are random distributed. An explenation for this could be that the unfitting cars are sold on the website but the seller lives in an other German speaking country (Switzerland or Austria) where the postal code system has 4 digits. Another possibility are exceptions in the system (Exceptions on Wikipedia#Sonderregelungen "Wikipedia")). As described threre are a few places which have because of several reasons a four digit postal code. Unfortunately we can't figure out from which country they are really from, if the unfitting cars are all based in locations which are an exception or if the data is just wrong.
Nevertheless, it looks like we could actually place most of the postal codes to coordinates. So let's get into it.
We load a dataset which has the geocordinates for each postal code (source: https://public.opendatasoft.com/explore/dataset/postleitzahlen-deutschland/export/).
#load the geodata to a new dataframe
geodata = pd.read_csv(datapath/'postleitzahlen_deutschland.csv')
#check the data
geodata.head(3)
#check if the the exceptions we mentioned above are in the dataset
geodata.loc[geodata['Postleitzahl'] < 9999].head(3)
As we can see the dataset contains the exceptions. So let's just try if we can merge the postal codes on our cars. If we have an error it's not that of a problem because all would contain the same error.
#rename the columns according to our train dataset
geodata.rename(columns={"Geo Point": "GeoPoint", "Postleitzahl": "postalCode"}, inplace=True)
#we don't need of the name of the municipality so lets drop it
geodata.drop('Ortsname',axis = 1, inplace = True)
#check if the changes are correct
geodata.head(1)
geodata.shape
Now we are ready to merge the two frames. We want to add the GeoPoint columns where ever the postalCode matches.
#before we merge we want to check if geodata only contains unique postal codes
geodata.nunique()
#to use postal code as an index we need to drop dublicates
geodata.drop_duplicates('postalCode', inplace = True)
#merge the frames (for 'GeoPoint' a new column is created)
train_merged = pd.merge(train, geodata, how="left", on="postalCode")
#check if it worked
train_merged.head(1)
Remember the 4 digit postal codes in the train set we need to check if there are some unfitting cars (null value in GeoPoint).
#check for null values
train_merged['GeoPoint'].isnull().sum()
For 218 columns there was not a matching postal code in the geodata frame. Let's just drop them.
#drop null values in 'GeoPoint'
train_merged.dropna(axis=0, subset = ['GeoPoint'], inplace = True)
train_merged['GeoPoint'].isnull().sum()
As we can see, we just got rid of 218 cars in our trainset by kicking out all cars with no geopoint.
To gain information we need to split up Latitude and Longitude.
#split Latitude and Longitude
train_merged['Latitude'] = train_merged.GeoPoint.str.split(',').str[0].astype(float)
train_merged['Longitude'] = train_merged.GeoPoint.str.split(',').str[1].astype(float)
#drop GeoPoint (the information is now in the nwe columns)
train_merged.drop('GeoPoint',axis = 1, inplace = True)
#check if it worked
train_merged.head(1)
Map
To gain more information we want to print out the cars and their price on a map of Germany.
For this we need the geopandas library.
The map to print on is a shapefile we downloaded on the same site like the table with the postal codes.
Copyright: The following concept of using geopandas to print out a price map and a lot of the code is adapted from this blog: https://medium.com/@ianforrest11/graphing-latitudes-and-longitudes-on-a-map-bf64d5fca391
#uncoment the following if you havent installed geopandas and descartes
#!pip install geopandas
#!pip install descartes
# import libraries
import geopandas as gpd # could be possible that you first have to install geopandas as <pip install geopandas>
from shapely.geometry import Point, Polygon
import matplotlib.pyplot as plt
# import street map
map_germany = gpd.read_file(datapath/'Map/postleitzahlen_deutschland.shp')
#create a GeoDataFrame
gdf = gpd.GeoDataFrame(train_merged, geometry=gpd.points_from_xy(train_merged.Longitude, train_merged.Latitude))
gdf.head(1)
#most of the cars are below 20'000; to gain information we will plot only these (else the scale is too large)
gdf = gdf.loc[gdf['price'] < 20000]
#create figure and axes, assign to subplot
fig, ax = plt.subplots(figsize=(15,15))
#add .shp mapfile to axes
map_germany.plot(ax=ax, alpha=0.4,color='grey')
#add geodataframe to axes;
#assign ‘price’ variable to represent coordinates on graph;
#make datapoints transparent using alpha;
#add legend;
#assign size of points using markersize
gdf.plot(column=gdf['price'],ax=ax, alpha=0.5, legend=True, markersize=10)
#add title to graph
plt.title('Price Map of Germany', fontsize=15,fontweight='bold')
#set latitude and longitude boundaries for map display
plt.xlim(5.5,15.5)
plt.ylim(45.5,55.5)
#show map
plt.show()
gdf.shape
We can clearly detect the german cities in the map (Berlin on top right). Also we can say that there are more cars in the south-west than in the east. Also there are more expensive cars in the south (around Munich).
Sadly it turned out that adding the latitude and longitude makes the score worse. Maybe because postal codes already divides the cars in german federal states which might be also influence the price (e.g. when people filter the website on cars in the own federal state). So we will continue with the train file without the merged latitude and longitude.
Reloading
As said before, we now reload the data, so that nothing above is included in the trainset
#load dataset
train = pd.read_csv(datapath/'trainset.csv')
train.groupby('seller')['carId'].nunique(), train.seller.unique()
The cars in our trainset are mostly, except three cars, offerd by privat individuals. Therefore, the category seller would disturb our prediction. Before we drop it, we did a pre-test and indeed, the RMSE is a little bit better without the column seller. Therefore, we now drop this category.
train = train[train.seller != 'gewerblich']
train.drop(['seller'], axis=1, inplace = True)
We now dropped the whole column seller, as we know that it is not representative and would disturb our predictions.
train.groupby('offerType')['carId'].nunique()
We want to make predictions for the price. Therefore, all offerTypes 'Gesuch' are not relevant for our model. Furthermore, there are just a tiny amount of 'Gesuch' in our datasets. Therefore, we firstly drop all cars with offerType = Gesuch.
train = train[train.offerType != 'Gesuch']
Secondly, we have to ged rid of the whole column offerType, because it is little useful for our purpose`
train.drop(['offerType'], axis=1, inplace = True)
The offerType coloumn is now dropped out of our trainset.
Same problem different feature
We thought that the internal feature of eBay has no correlation and therefore no influence on the score at the end. Hence, we initially dropped it. But that was a mistakee. The abtestfeature has a positive inpact on the score, therefore we leave it as it is.
There are two approaches for filling in missing values. We are using the .modemethod. Initially we filled missing values with the kNN (k-nearest-neighbour) method. But after checking the score, we were disappointed. The score increased pretty much. Since the kNN-method needs to have only numerical values, you can see the procedure we've been trough later.
#save the df with missing values for later
train_with_missing = train.copy()
train['vehicleType'].fillna(train['vehicleType'].mode()[0], inplace=True)
train['gearbox'].fillna(train['gearbox'].mode()[0], inplace=True)
train['model'].fillna(train['model'].mode()[0], inplace=True)
train['fuelType'].fillna(train['fuelType'].mode()[0], inplace=True)
train['notRepairedDamage'].fillna(train['notRepairedDamage'].mode()[0], inplace=True)
train.isnull().sum()
Note: In converting categorical to numeric the Null values change to 0 because in the conversion Null turns into -1 and then we calculate the categorical value +1. Means: Null --> -1 --> 0
train.info()
Now we want to convert the object dtype columns into categorical and later in numeric.
def convert_strings_to_categories_train(df):
"""
A function to convert all object columns of a dataframe with the Dtype 'object' into 'category' Dtype.
The function creates variables in which the category order of the train dataset
(which are needed in the 'convert_strings_to_categories_test' function) is saved for later.
"""
for c in df:
if df[c].dtype == 'object':
df[c] = df[c].astype('category')
name_temp = str("cat_"+c)
globals()[name_temp] = df[c].cat.categories
print(name_temp)
convert_strings_to_categories_train(train)
def convert_categories_to_numeric(df):
"""
A function to convert all object columns of a dataframe with the Dtype 'category'
into a numeric Dtype (based on the category codes).
"""
for c in df:
if df[c].dtype.name == 'category':
df[c] = df[c].cat.codes + 1
convert_categories_to_numeric(train)
train.info()
As said above we were using the kNN-method to fill in missing values in all datasets (train, valid, test). Since we know, that this is worse than the .modemethod, we do not adapt these changes. Therefore again, save the trainsetand then reload it.
#safe data as .csv file
train.to_csv(datapath/'trainset2.csv', index = False)
#overwrite the trainset to train with missing
train = train_with_missing.copy()
#make the columns numerical again
convert_strings_to_categories_train(train)
convert_categories_to_numeric(train)
Note: We are going to use the K-Nearest Neighbour Method which is not adapted to our trainset
Steps:
Following codes show all the steps above for every feature the dataset includes missing values.
Note: As explained before, missing values are at this state 0 (because -1 in categorical and (-1)+1=0)
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
def impute_train(target_column):
"""
This function imputes missing values in the train set (only for categorical data) by using the knn method.
Args:
target_column: string of a column name in the train df (str)
"""
#split the df in two (one with all null values in target_column and one without)
train_no_missing = train.loc[train[target_column] != 0] # all rows with given value
train_missing = train.loc[train[target_column] == 0] # all rows with no given value
#create a list with for the feature matrix
X_columns = list(train.columns)
X_columns.remove(target_column)
#creat feature matrix and target vector
X = train_no_missing[X_columns].dropna()
y = train_no_missing[target_column]
#to build up the multi dimensional space to perfom knn we have to scale the dimensions (columns)
#the scaler from sklearn scales down the values to the same range (-1 - 1)
scaler = StandardScaler()
X = scaler.fit_transform(X)
train_missing = scaler.fit_transform(train_missing[X_columns])
#Create the knn model.
#Look at the five closest neighbors.
knn = KNeighborsRegressor(n_neighbors=5)
#Fit the model on the training data.
knn.fit(X,y)
#Make point predictions on the test set using the fitted model.
predictions = knn.predict(train_missing)
#replace the null values in the train set with the predictions
train.loc[train[target_column] == 0,target_column]= predictions.astype(int)
Now we can apply the function to all columns with null values (look above).
Nice to know: Since we already used the
.modemethod, the functionimpute_trainwouldn't work with the filled train set, because there are no missing values anymore.
#apply the function to the columns
impute_train('vehicleType')
impute_train('gearbox')
impute_train('model')
impute_train('fuelType')
impute_train('notRepairedDamage')
#check if all null values are filled
print(min(train['vehicleType']))
print(min(train['gearbox']))
print(min(train['model']))
print(min(train['fuelType']))
print(min(train['notRepairedDamage']))
As we said, now reloading the saved dataset, that the kNN imputations above will not affect our models performance!
#load dataset
train = pd.read_csv(datapath/'trainset2.csv')
We now want to see correlations of the different attributes. What is of most interest are the correlations with price, of course. Because in the end we are going to predict the price.
correlation_matrix = train.corr()
plt.subplots(figsize=(20,15))
sns.heatmap(abs(correlation_matrix), annot=True, fmt='.2f', linewidths=1);
The correlation matrix is massively useful. It shows correlations between each feature within the dataset. Hence, We will drop the features which are really redundant (according to our logic and/or the correlation matrix) and then look again at the matrix.
Important: A correlation matrix doesn't show all dependencies between two features! To have a better insight one approach is to plot out the dependency and visually detect patterns.
#drop the columns we think are redundant
train.drop(['dateCrawled'], axis=1, inplace = True)
train.drop(['carId'], axis=1, inplace = True)
train.drop(['name'], axis=1, inplace = True)
train.drop(['dateCreated'], axis=1, inplace = True)
train.drop(['lastSeen'], axis=1, inplace = True)
train.drop(['nrOfPictures'], axis=1, inplace = True)
train.drop(['abtest'], axis=1, inplace = True)
#generate a correlation matrix
correlation_matrix = train.corr()
#set the plotsize
plt.subplots(figsize=(20,15))
#create the heatmap plot with the correlation matrix data
sns.heatmap(abs(correlation_matrix), annot=True, fmt='.2f', linewidths=1);
We can see that the features price, age, powerPS and kilometer have the strongest correlation with price. So let's have a deeper look.
#generate a pairplot for the most interesting (most correlated) features
attributes = ["price", "age", "powerPS", "kilometer"]
sns.pairplot(train[attributes].dropna());
As we can see the distribution of age and powerPS are somehow non-linear and not random. Lets have a deeper look
#have a look at the correlation of 'price' and 'age' (for better understanding we cut away the high prices)
sns.jointplot('age', 'price', data=train.loc[(train['price'] < 25000 ) & (train['age'] < 30)], kind='kde');
We can see that most of the cars are between 10 and 25 years old and cost less than 5000. There is a non linear relationship from top left to bottom right.
#have a look at the correlation of 'price' and 'powerPS'
sns.jointplot('powerPS', 'price', data=train.loc[(train['price'] < 15000 ) & (train['powerPS'] < 250)], kind = 'kde')
In 'powerPS' we can see a slight linear correlaton from left-bottom to right-top.
train.info()
As we can see, our traindata has now not only no missing values, they are all numerical as well. So we safe this train set as train_final.csvso that we can easily use it in later steps.
#safe processed data as .csv file
train.to_csv(datapath/'train_final.csv', index = False)
As right at the beginning mentioned, we splitted the trainset into trainsetand validsetso that we now can treat the validset as the testset.
Since for evaluaten, all sets must have the same features, we now implement above new created or renamed features for the validset. Therefore, we waive much detailed explaination of each steps.
valid = pd.read_csv(datapath/'valid1.csv')
valid.info()
valid = valid.loc[valid['price'] < 200000]
# create the age feature
valid['dateCrawled'] = pd.to_datetime(valid['dateCrawled'])
valid['age'] = valid['dateCrawled'].dt.year - valid['yearOfRegistration']
valid = valid.loc[valid['yearOfRegistration'] > 1901]
valid.drop(['yearOfRegistration'], axis=1, inplace = True)
#deal with unrealistic value
valid.loc[valid['age'] < 0, "age"]= valid['age'].median()
valid.loc[valid['age'] > 106, "age"]= valid['age'].median()
valid.loc[valid['powerPS'] > 1500, "powerPS"] = valid['powerPS'].median()
#drop same columns we droped in the train set
#valid.drop(['yearOfRegistration'], axis=1, inplace = True)
valid.drop(['seller'], axis=1, inplace = True)
valid.drop(['carId'], axis=1, inplace = True)
valid.drop(['offerType'], axis=1, inplace = True)
valid.drop(['dateCrawled'], axis=1, inplace = True)
valid.drop(['name'], axis=1, inplace = True)
valid.drop(['dateCreated'], axis=1, inplace = True)
valid.drop(['lastSeen'], axis=1, inplace = True)
valid.drop(['nrOfPictures'], axis=1, inplace = True)
valid.drop(['abtest'], axis=1, inplace = True)
#fill missing values with the same method we used in the train set
valid['vehicleType'].fillna(train['vehicleType'].mode()[0], inplace=True)
valid['gearbox'].fillna(valid['gearbox'].mode()[0], inplace=True)
valid['model'].fillna(train['model'].mode()[0], inplace=True)
valid['fuelType'].fillna(valid['fuelType'].mode()[0],inplace=True)
valid['notRepairedDamage'].fillna(valid['notRepairedDamage'].mode()[0], inplace=True)
valid.isnull().sum()
We got rid of the null values. Now the validsetis ready to be prepared, so that all feature types are numerical.
def convert_strings_to_categories_test(df):
"""
A function to convert all object columns of a dataframe with the Dtype 'object' into 'category' Dtype.
The function calls dictionaries with the category order of the train dataset
which are created in the 'convert_strings_to_categories_train' function.
"""
for c in df:
if df[c].dtype == 'object':
df[c] = df[c].astype('category')
name_temp = str("cat_"+c)
df[c].cat.set_categories(globals()[name_temp], ordered=True, inplace=True)
print(name_temp)
convert_strings_to_categories_test(valid)
convert_categories_to_numeric(valid)
As we can see, our validset has now not only no missing values, they are all numerical as well. So we safe this train set as valid_final.csvso that we can easily use it in later steps.
#safe processed data as .csv file
valid.to_csv(datapath/'valid_final.csv', index = False)
One hot encoding can improve the score by eliminating the numerical logic of decision trees. For example in our fuelType column. We have 7 types of fuel. In our case these 7 different types are represented by values (1-7). The problem is that the decision trees can get confused because the order of the values have no meaning at all.
Unlike "real" numeric values by which it has a meaning when two values are near the price of two cars with 700 / 710 PS is more similar than of two cars with 30 / 700 PS.
Long speech short sense: we want to apply one hot encoding to important columns (gearbox;fuelType; notRepairedDamage).
Note: We won't apply it to all columns because more columns need more computation power.
#load train processed data
train_processed = pd.read_csv(datapath/'train_final.csv')
#load valid processed data
valid_processed = pd.read_csv(datapath/'valid_final.csv')
We first have to change thesde three features gearbox,fuelType, notRepairedDamage back to categorical values, due to the hot encoding.
#create dictionaries out of categorical mappings
dict_gearbox = dict(enumerate(cat_gearbox))
dict_fuelType = dict(enumerate(cat_fuelType))
dict_notRepairedDamage = dict(enumerate(cat_notRepairedDamage))
#replace the numbers with strings
train_processed["gearbox"] = (train_processed["gearbox"]-1).replace(dict_gearbox)
train_processed["fuelType"] = (train_processed["fuelType"]-1).replace(dict_fuelType)
train_processed["notRepairedDamage"] = (train_processed["notRepairedDamage"]-1).replace(dict_notRepairedDamage)
valid_processed["gearbox"] = (valid_processed["gearbox"]-1).replace(dict_gearbox)
valid_processed["fuelType"] = (valid_processed["fuelType"]-1).replace(dict_fuelType)
valid_processed["notRepairedDamage"] = (valid_processed["notRepairedDamage"]-1).replace(dict_notRepairedDamage)
#convert the columns to categories
convert_strings_to_categories_test(train_processed)
convert_strings_to_categories_test(valid_processed)
#apply one hot encoding to the corresponding columns
train_processed = pd.get_dummies(train_processed, columns=['gearbox','fuelType','notRepairedDamage'])
valid_processed = pd.get_dummies(valid_processed, columns=['gearbox','fuelType','notRepairedDamage'])
As we now have learned in the process, we now need to again create the feature matrix X and the target vector y. The result after all this preprocessing is much better now.
#create the feature matrix X and the target vector y
X_train = train_processed.drop('price', axis=1)
y_train = train_processed['price']
#create the feature matrix X and the target vector y for the validation set
X_valid = valid_processed.drop('price', axis=1)
y_valid = valid_processed['price']
As mentioned in the description of the competition the score is evaluated on the Root-Mean-Squared-Error (RMSE).
$$ \mbox{RMSE} = \sqrt{\frac{1}{m}\sum_{i=1}^m \left(\hat{y}_i - y_i\right)^2}$$RMSE describes the standard deviation of the predicted values. It has a higher weight on larger errors. Which means the method is more sensitive to outliers.
#function is copied/adapted from Lewis and Leandros Data Science Lectures
def rmse(y, yhat):
"""A utility function to calculate the Root Mean Square Error (RMSE).
Args:
y (array): Actual values for target.
yhat (array): Predicted values for target.
Returns:
rmse (double): The RMSE.
"""
a = np.floor(np.expm1(y))
b = np.floor(np.expm1(yhat))
return np.sqrt(mean_squared_error(y, yhat))
#function is copied/adapted from Lewis and Leandros Data Science Lectures
def print_rf_scores(fitted_model):
"""Generates RMSE and R^2 scores from fitted Random Forest model."""
yhat_train = fitted_model.predict(X_train)
R2_train = fitted_model.score(X_train, y_train)
yhat_valid = fitted_model.predict(X_valid)
R2_valid = fitted_model.score(X_valid, y_valid)
scores = {
"RMSE on train:": rmse(y_train, yhat_train),
"R^2 on train:": R2_train,
"RMSE on valid:": rmse(y_valid, yhat_valid),
"R^2 on valid:": R2_valid,
}
if hasattr(fitted_model, "oob_score_"):
scores["OOB R^2:"] = fitted_model.oob_score_
for score_name, score_value in scores.items():
print(score_name, round(score_value, 3))
Our first RandomForestRegressormodel is not finally created. This model is not optimized, due to no hyperparameter tuning. We just show our first RMSE score.
model = RandomForestRegressor(n_estimators=30, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
print_rf_scores(model)
We achieved a RMSE of 4168 without for our simple RandomForest Model. Thats not bad (considering the price range of 200'000 in the train set)!
Let's see how far we can go when we choose other models and tune them.
The images speaks for itself...
To pick the best model, we compare different regression models (adapted from https://www.kaggle.com/veensk/titanic-multiple-classifier-analysis). We've just picked a few regressers we stumbled across.
#create a list with different model types
models = [
RandomForestRegressor(),
GradientBoostingRegressor(),
AdaBoostRegressor(),
HistGradientBoostingRegressor(),
xgb.XGBRegressor()
]
log_cols = ["Model", "RMSE"]
log = pd.DataFrame(columns=log_cols)
#let every model predict on the validation set and save the RMSE scores in a dictionary
for m in models:
name = m.__class__.__name__
m.fit(X_train, y_train)
train_predictions = m.predict(X_valid)
RMSE = rmse(y_valid, train_predictions)
log = log.append({'Model':name, 'RMSE':RMSE}, ignore_index=True)
#plot out the RMSE scores to compare them
plt.title('Model Performance')
plt.xlabel('RMSE')
sns.set_color_codes("pastel")
sns.barplot(x="RMSE", y="Model", data=log)
The XGBRegressor scores the best without optimization. Let's see how far we can go when we tune each of these models. Because we are most familiar with the random forest model we will start with that.
As written bevore, we want to tune the hyperparameters for the RandomForest model.
To do so we want to go through each parameter and search for the best possible value. The order of which parameter to tune first matters.
number of trees
At first we want to find out when the score of a RandomForest model doesn't improve significantly by just adding more trees. This helps in the further process because the time for fitting a model with 30 trees is much shorter than fitting a model with 100 trees. Because of the iterations (in the cells below) we can save hours. ;)
Copyright: The following concept of analyzing each tree is adapted from Lewis and Leandros ds lectures.
#initialize a rf with 100 trees so we can analyze the impact of each additional tree from 1 - 100
model = RandomForestRegressor(n_estimators = 100, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
print_rf_scores(model)
#let every tree predict on the validation set
preds = np.stack([tree.predict(X_valid) for tree in model.estimators_])
#initialize a list for the predictions
preds_list = []
#append the r2 score for by append the trees one for one
for tree in model.estimators_:
preds_list.append(tree.predict(X_valid))
# concatenate list of predictions into single array
preds_v2 = np.stack(preds_list)
# test that arrays are equal
assert_array_equal(preds, preds_v2)
#generate a plot for the r2 score when we add tree after tree
#this function is adapted from Lewis and Leandros ds lectures
def plot_r2_vs_trees(preds, y_valid):
"""Generate a plot of R^2 score on validation set vs number of trees in Random Forest"""
fig, ax = plt.subplots()
plt.plot(
[
r2_score(y_valid, np.mean(preds[: i + 1], axis=0))
for i in range(len(preds) + 1)
]
)
ax.set_ylabel("$R^2$ on validation set")
ax.set_xlabel("Number of trees")
plot_r2_vs_trees(preds, y_valid)
Note: Above 40 trees the R2 value gets only slightly better. So we will work with just 40 estimators for thehyperparameter tuning.
Grid Search
from sklearn.model_selection import GridSearchCV
# define range of values for each hyperparameter
param_grid = [
{
"max_features": [0.05, 0.2, 0.5, 1.0, "sqrt", "log2"],
"min_samples_leaf": [0.1,0.5,1, 3, 5, 10],
}
]
# instantiate baseline model
model = RandomForestRegressor(n_estimators=60, n_jobs=-1, random_state=42)
# initialise grid search with cross-validation
grid_search = GridSearchCV(
model, param_grid=param_grid, cv=3, scoring="neg_root_mean_squared_error", n_jobs=-1)
!!! Attention: Uncoment the next cell to perform a grid search but this will take up to several hours computation time. !!!
#apply the grid search for the initialized rf model
#%time grid_search.fit(X_train, y_train)
#print(grid_search.best_params_)
#print(grid_search.best_estimator_)
#print_rf_scores(best_model)
Copyright: The following concept for analysing hyperparameters is copied from lewis and leandros data science lectures.
rmse_train = []
rmse_valid = []
max_features = [0.05, 0.2, 0.5, 1.0, "sqrt", "log2"]
for f in max_features:
rf = RandomForestRegressor(n_estimators=40, max_features=f, n_jobs=-1, random_state = 42)
rf.fit(X_train, y_train)
yhat_train = rf.predict(X_train)
yhat_valid = rf.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
f_best = max_features[np.argmin(rmse_valid)]
f_best
rmse_train = []
rmse_valid = []
max_depths = [1, 2, 16, 32, 64, 100, 150]
for d in max_depths:
rf = RandomForestRegressor(n_estimators=40, max_depth=d, max_features=f_best, n_jobs=-1, random_state = 42)
rf.fit(X_train, y_train)
yhat_train = rf.predict(X_train)
yhat_valid = rf.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
d_best = max_depths[np.argmin(rmse_valid)]
d_best
rmse_train = []
rmse_valid = []
min_samples_leaf = [0.1,0.5,1, 3, 5, 10, 20, 100, 300]
for l in min_samples_leaf:
rf = RandomForestRegressor(n_estimators=40, max_depth=d_best, max_features=f_best, \
min_samples_leaf=l, n_jobs=-1, random_state = 42)
rf.fit(X_train, y_train)
yhat_train = rf.predict(X_train)
yhat_valid = rf.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
l_best = min_samples_leaf[np.argmin(rmse_valid)]
l_best
rmse_train = []
rmse_valid = []
min_samples_split = [2, 3, 4, 10]
for s in min_samples_split:
rf = RandomForestRegressor(n_estimators=40, max_depth=d_best, max_features=f_best, \
min_samples_leaf=l_best, min_samples_split= s, n_jobs=-1, random_state = 42)
rf.fit(X_train, y_train)
yhat_train = rf.predict(X_train)
yhat_valid = rf.predict(X_valid)
# we average the scores and append them to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
s_best = min_samples_split[np.argmin(rmse_valid)]
s_best
rmse_train = []
rmse_valid = []
min_impurity_split = [0.5,1,2, 3, 4, 10]
for i in min_impurity_split:
rf = RandomForestRegressor(n_estimators=40, max_depth=d_best, max_features=f_best, \
min_samples_leaf=l_best, min_samples_split= s, n_jobs=-1, random_state = 42,\
min_impurity_split=i)
rf.fit(X_train, y_train)
yhat_train = rf.predict(X_train)
yhat_valid = rf.predict(X_valid)
# we average the scores and append them to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
i_best = min_impurity_split[np.argmin(rmse_valid)]
i_best
#let the best rf model predict on the validation set
rf = RandomForestRegressor(n_estimators=40, max_depth=d_best, max_features=f_best, \
min_samples_leaf=l_best, min_samples_split= s, n_jobs=-1, random_state = 42,\
min_impurity_split=i)
rf.fit(X_train, y_train)
print_rf_scores(rf)
We have improved the RMSE on the validation set by 200. Let's see how the model improves when we take more estimators.
rf = RandomForestRegressor(n_estimators=300, max_depth=d_best, max_features=f_best, \
min_samples_leaf=l_best, min_samples_split= s, n_jobs=-1, random_state = 42,\
min_impurity_split=i)
rf.fit(X_train, y_train)
print_rf_scores(rf)
This gives us another 40. Let's stay with that and check out what's possible with other regression models.
But first we want to save the hyperparameters so we don't have to run the whole thing again.
#print out the best hyperparameters so we can copy this model when we need it.
rf
#
rf = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=16, max_features=0.5, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=10, min_samples_leaf=1,
min_samples_split=10, min_weight_fraction_leaf=0.0,
n_estimators=300, n_jobs=-1, oob_score=False,
random_state=42, verbose=0, warm_start=False)
rf.fit(X_train, y_train)
predictionsrf = rf.predict(X_valid)
# create a submission DataFrame
compare_price_rf = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_rf['Price'] = y_valid
compare_price_rf['Predictions'] = predictionsrf
compare_price_rf['Difference'] = compare_price_rf['Price'] - compare_price_rf['Predictions']
#check if all worked fine
compare_price_rf.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_rf,
kind="reg", truncate=False,
color="b", height=7)
We can see that our RandomForest model predicts bad on higher priced cars. The difference increases linearly compared to the price.
We think that this is because most of the training data is in the lower price region so the model can't train on the higher priced cars.
compare_price_rf["Difference"].describe()
As we can see, out of more than 37'000 predictions the RandomForestRegressoris predicting overall i little bit to much. Therefore, the mean as well as the median is negative.
Another approach to predict values instead of using the RandomForestRegressoris the GradientBoostingRegressor. The procedure of defining hyperparameters is the same as with the RandomForestRegressor. However, the parameters can be different.
from sklearn import ensemble
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
# define range of values for each hyperparameter
param_grid = [
{
"learning_rate": [0.05,0.1,0.15,0.5],
"n_estimators": [16,32,64,100,300],
"max_features": [0.05, 0.2, 0.5, 1.0, "sqrt", "log2"],
"min_samples_leaf": [0.1,0.5,1, 3, 5, 10],
"max_depth": [1,2,3,4,5]
}
]
# instantiate baseline model
gbr_model = GradientBoostingRegressor(n_estimators=25, random_state=42)
# initialise grid search with cross-validation
grid_search = GridSearchCV(
gbr_model, param_grid=param_grid, cv=3, scoring="neg_root_mean_squared_error", n_jobs=-1)
The next few codes are part of the Grid Search. It optimizes all given parameters with each other so that the regressor will perform the most powerful predictions.
Attention!! Running the Grid Search command can take hours
#%time grid_search.fit(X_train, y_train)
#print(grid_search.best_params_)
#best_params = grid_search.best_params_
#best_params
#best_model = grid_search.best_estimator_
#best_model
#print_rf_scores(best_model)
Another approach to get best values for each parameter is the following procedure. It allows to optimize each parameter step by step. Out of several tests we found out, that some parameters are fitted best with the default value given, such as max_leaves_nodes. Therefore, we optimize the following parameters:
As we've learned, just do one parameter at once and check if the changes positively infliuence the RMSE.
Tipp Letting the functions evaluating the best values is taking a lot of time. Therefore, once we were fine with tuning these parameters, we printed the whole GradientBoostingRegressor out, that these parameters were set. (see below)
At first we want to check our RMSE without hyperparameter tuning.
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
print_rf_scores(gbr)
Copyright: The following concept for analysing hyperparameters is copied from lewis and leandros data science lectures.
rmse_train = []
rmse_valid = []
max_depth= [25,30,35]
for mg in max_depth:
gbr = GradientBoostingRegressor(max_depth=mg, max_features=1, learning_rate=0.15, \
min_samples_leaf=3, random_state = 42)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
mg_best = max_depth[np.argmin(rmse_valid)]
mg_best
rmse_train = []
rmse_valid = []
learning_rate = [0.001,0.01,0.05]
for lg in learning_rate:
gbr = GradientBoostingRegressor(max_depth=mg_best, max_features=1, learning_rate=lg, \
min_samples_leaf=3, random_state = 42)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
lg_best = learning_rate[np.argmin(rmse_valid)]
lg_best
rmse_train = []
rmse_valid = []
n_estimators=[150,170,200]
for ng in n_estimators:
gbr = GradientBoostingRegressor(n_estimators=ng, max_depth=mg_best, max_features=1, learning_rate=lg_best, \
min_samples_leaf=3, random_state = 42)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
ng_best = n_estimators[np.argmin(rmse_valid)]
ng_best
rmse_train = []
rmse_valid = []
max_features = [0.05, 0.2, 0.5, 1.0, "sqrt", "log2"]
for fg in max_features:
gbr = GradientBoostingRegressor(n_estimators=ng_best, max_depth=mg_best, max_features=fg, learning_rate=lg_best, \
min_samples_leaf=3, random_state = 42)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
fg_best = max_features[np.argmin(rmse_valid)]
fg_best
rmse_train = []
rmse_valid = []
min_samples_split = [1.0,2,3]
for sg in min_samples_split:
gbr = GradientBoostingRegressor(n_estimators=ng_best, max_depth=mg_best, max_features=fg_best, learning_rate=lg_best, \
min_samples_leaf=3, random_state = 42, min_samples_split=sg)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
sg_best = min_samples_split[np.argmin(rmse_valid)]
sg_best
rmse_train = []
rmse_valid = []
min_samples_leaf = [5,6,8]
for ming in min_samples_leaf:
gbr = GradientBoostingRegressor(n_estimators=ng_best, max_depth=mg_best, max_features=fg_best, learning_rate=lg_best, \
min_samples_leaf=ming, random_state = 42, min_samples_split=sg_best)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
ming_best = min_samples_leaf[np.argmin(rmse_valid)]
ming_best
rmse_train = []
rmse_valid = []
min_impurity_split = [0,1]
for ims in min_impurity_split:
gbr = GradientBoostingRegressor(n_estimators=ng_best, max_depth=mg_best, max_features=fg_best, learning_rate=lg_best, \
min_samples_leaf=ming_best, random_state = 42, min_samples_split=sg_best,\
min_impurity_split=ims)
gbr.fit(X_train, y_train)
yhat_train = gbr.predict(X_train)
yhat_valid = gbr.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
ims_best = min_impurity_split[np.argmin(rmse_valid)]
ims_best
#initialize the gbr model
gbr = GradientBoostingRegressor(n_estimators=ng_best, max_depth=mg_best, max_features=fg_best, learning_rate=lg_best, \
min_samples_leaf=ming_best, random_state = 42, min_samples_split=sg_best,\
min_impurity_split=ims)
gbr.fit(X_train, y_train)
We achieved a massive improvement with hyperparametertuning (around 780). Which is very cool!! We will see later, that the gbr is the regressor with the lowest RMSE score. Therefore, it is expected, that in the following stacked/blended model the percentage of gbr is higher than other regressors.
#save the parameters for later
gbr
gbr = GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
init=None, learning_rate=0.05, loss='ls',
max_depth=30, max_features='sqrt',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=1, min_samples_leaf=6,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_iter_no_change=None,
presort='deprecated', random_state=42, subsample=1.0,
tol=0.0001, validation_fraction=0.1, verbose=0,
warm_start=False)
gbr.fit(X_train, y_train)
predictionsgbr = gbr.predict(X_valid)
# create a submission DataFrame
compare_price_gbr = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_gbr['Price'] = y_valid
compare_price_gbr['Predictions'] = predictionsgbr
compare_price_gbr['Difference'] = compare_price_gbr['Price'] - compare_price_gbr['Predictions']
#check if all worked fine
compare_price_gbr.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_gbr,
kind="reg", truncate=False,
color="b", height=7)
compare_price_gbr["Difference"].describe()
The price difference is in the GradientBoostingRegressorless vulnerable on higher prices of cars in the validset. But still the effect is the same as it is in the RandomForestRegressor. Also the GBR is predicting in average a higher price than the 'real' price.
According to scikit, impurity-based feature importance can be very misleading. Therefore, as they recommend, we are also using besxides feature_importance permutation_importance. We use the validset, since we want to know the models feature importance on the testset. Hence, I am a happy guy that we already have created such a validset, which is more similar to the testset than the trainset.
Copyright: The following code is from: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
from sklearn import datasets, ensemble
from sklearn.inspection import permutation_importance
feature_importance = gbr.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(valid_processed.columns)[sorted_idx])
plt.title('Feature Importance (MDI)')
plt.ylabel("Features of the Validset")
result = permutation_importance(gbr, X_valid, y_valid, n_repeats=10,
random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
plt.subplot(1, 2, 2)
plt.boxplot(result.importances[sorted_idx].T,
vert=False, labels=np.array(valid_processed.columns)[sorted_idx])
plt.title("Permutation Importance (MDA)")
plt.ylabel("Features of the Validset")
fig.tight_layout()
plt.show()
MDI means Mean Decrease in Impurity. MDA means Mean Decrease in Accuracy. As it can be seen in the figures above, five most influencable.
XGBoost is an advanced method of gradient boosting. It runs faster and uses more accurate methods to estimate the best tree model.
#we need to transform the data to a special format (DMatrix) so the xgb regressor can handle the data
D_train = xgb.DMatrix(X_train, label=y_train)
D_valid = xgb.DMatrix(X_valid, label=y_valid)
#set the parameters for the regressor (cause we want a simple model first we create an empty dictionary)
param = {}
#set the number of iterations
steps = 20
#because XGB works different we need to change our scoring function
def print_XGB_scores(fitted_model):
"""Generates RMSE scores from fitted XGB model."""
yhat_train = fitted_model.predict(D_train)
yhat_valid = fitted_model.predict(D_valid)
scores = {
"RMSE on train:": rmse(y_train, yhat_train),
"RMSE on valid:": rmse(y_valid, yhat_valid)
}
for score_name, score_value in scores.items():
print(score_name, round(score_value, 3))
model = xgb.train(param, D_train, steps)
print_XGB_scores(model)
We can see that the score isn't that bad (compared to other regressors) for no hyperparameter tuning. Now let's get into it and tune these hyperparameters...
Note: Don't be confused. We found out that XGB has actually a built in sklearn API. Because it's easier to handle we use the sklearn method for the following hyperparameter tuning. But we leave the above approach to make clear that XGB is not a sklearn regressor....
Copyright: The following concept for analysing hyperparameters is copied from lewis and leandros data science lectures.
n_estimators
Represents the number of gradient boosted trees.
rmse_train = []
rmse_valid = []
n_estimators = [20,22,24,25,26,28]
for n in n_estimators:
xgb_model = xgb.XGBRegressor(n_estimators = n, n_jobs = -1, random_state = 42)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
n_best = n_estimators[np.argmin(rmse_valid)]
n_best
max_depth
Represents the maximum depth of the base learner trees.
rmse_train = []
rmse_valid = []
max_depth = [ 3, 4, 5, 6, 8, 10, 12, 15]
for d in max_depth:
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d, n_jobs = -1, random_state = 42)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
d_best = max_depth[np.argmin(rmse_valid)]
d_best
learning_rate (=eta)
The learning rate defines the step size in each iteration of optimization (of the loss function).
--> prevents overfitting
rmse_train = []
rmse_valid = []
learning_rate = [0.01,0,4,0.5,0.6,1]
for l in learning_rate:
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d_best, learning_rate = l, n_jobs = -1, random_state = 42)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
l_best = learning_rate[np.argmin(rmse_valid)]
l_best
gamma
According to this forum "Gamma causes shallower trees (or, at least, trees with fewer leaves), by restricting when splits will be made". The higher gamma is the higher is the regularization.
rmse_train = []
rmse_valid = []
gamma = [ 0, 0.1, 0.2 , 0.3, 0.4,1,10,20,50,100000]
for g in gamma:
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d_best, learning_rate = l_best, n_jobs = -1, random_state = 42, gamma = g)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
g_best = gamma[np.argmin(rmse_valid)]
g_best
Note: I think gamma isn't optimized perfectly, or maybe it wouldn't need any optimization at all. But to be honest i will just continue with my g_best value.
min_child_weight
As much as i understand this number simply corresponds to minimum number of instances needed to be in each node (similar to min_samples_leaf in a RandomForest).
rmse_train = []
rmse_valid = []
min_child_weight = [0,0.001, 0.01, 0.1, 0.3, 0.5, 1, 3]
for c in min_child_weight:
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d_best, learning_rate = l_best, n_jobs = -1,\
random_state = 42, gamma = g_best, min_child_weight=c)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
c_best = min_child_weight[np.argmin(rmse_valid)]
c_best
booster
Which model type to use (gbtree creates a tree based model; gblinear creates a linear model).
rmse_train = []
rmse_valid = []
booster = ['gbtree', 'gblinear', 'dart']
for b in booster:
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d_best, learning_rate = l_best, booster = b, n_jobs = -1,\
random_state = 42, gamma = g_best, min_child_weight=c_best)
xgb_model.fit(X_train, y_train)
yhat_train = xgb_model.predict(X_train)
yhat_valid = xgb_model.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
b_best = booster[np.argmin(rmse_valid)]
b_best
We've gone through some of the parameters. Let's check if our score has improved.
xgb_model = xgb.XGBRegressor(n_estimators = n_best, max_depth = d_best, learning_rate = l_best, booster = b_best, n_jobs = -1,\
random_state = 42, gamma = g_best, min_child_weight=c_best)
xgb_model.fit(X_train, y_train)
print_rf_scores(xgb_model)
#print out the model parameters for later
xgb_model
Note: I have improved the score for the XGB model. I think there is way more potential in this model. But to be completly honest the mathematical stuff behind the parameters i've tuned is way too complicated for me to understand. ;) So i will be content with what i've done and continue...
How is my XGBoost performing?
Like we have seen before (for the gradient booster) we want to look at how our XGBoost is making it's predictions.
#let the model predict on the validation set
predictionsxgb = xgb_model.predict(X_valid)
# create a submission DataFrame
compare_price_xgb = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_xgb['Price'] = y_valid
compare_price_xgb['Predictions'] = predictionsxgb
compare_price_xgb['Difference'] = compare_price_xgb['Price'] - compare_price_xgb['Predictions']
#check if all worked fine
compare_price_xgb.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_xgb,
kind="reg", truncate=False,
color="b", height=7)
compare_price_xgb["Difference"].describe()
We can see clearly that our classifier gets worse on higher priced cars (the prediction on them is too low). We can't find a trend in predicting too high or low. But the prediction for low priced cars are more likely too low and vice versa.
Feature Importance
We want to know which features are the most important in our xgb regressor.
Note: This is just a very short analysis. Find more about model understanding and feature importance in chapter 10.
#create a df with the calculated feature importance for our model
pd.DataFrame(
{"Column": X_train.columns, "Importance": xgb_model.feature_importances_}
).sort_values("Importance", ascending=False)
# print out a bar plot for the feature importance
sns.barplot(y="Column", x="Importance", data=feature_importance, color = 'b')
Findings:
powerPS and kilometer features are very important.age and kilometer are important features.age is the most important feature in this model. In Manuels gbr this feature has only a minor importance.Like before we try to improve our model by tuning each single parameter.
model = HistGradientBoostingRegressor(random_state=42)
model.fit(X_train, y_train)
print_rf_scores(model)
Copyright: The following concept for analysing hyperparameters is copied from lewis and leandros data science lectures.
rmse_train = []
rmse_valid = []
loss = ['least_squares', 'least_absolute_deviation']
for l in loss:
hgb = HistGradientBoostingRegressor(loss = l, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
l_best = loss[np.argmin(rmse_valid)]
l_best
rmse_train = []
rmse_valid = []
learning_rate = [0.001,0.1,0.2,0.25,0.3,0.5,0.75,1]
for lr in learning_rate:
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
lr_best = learning_rate[np.argmin(rmse_valid)]
lr_best
rmse_train = []
rmse_valid = []
max_iter = [5,10,20,30,50,100,500,655,800]
for i in max_iter:
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr_best, max_iter = i, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
i_best = max_iter[np.argmin(rmse_valid)]
i_best
rmse_train = []
rmse_valid = []
max_leaf_nodes = [10,31,40,100,150,200,500]
for ln in max_leaf_nodes:
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr_best, max_iter = i_best,max_leaf_nodes = ln, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
ln_best = max_leaf_nodes[np.argmin(rmse_valid)]
ln_best
rmse_train = []
rmse_valid = []
max_depth = [None,5,10,20,100,200,500]
for d in max_depth:
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr_best, max_iter = i_best, max_leaf_nodes = ln_best, max_depth = d, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
d_best = max_depth[np.argmin(rmse_valid)]
d_best
Note: if no value is printed, d_best = None
rmse_train = []
rmse_valid = []
validation_fraction = [0.0000001,0.0001,0.001,0.01,0.1,0.2,0.5,0.8]
for v in validation_fraction:
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr_best, max_iter = i_best,max_leaf_nodes = ln_best, max_depth = d_best, validation_fraction = v, random_state = 42)
hgb.fit(X_train, y_train)
yhat_train = hgb.predict(X_train)
yhat_valid = hgb.predict(X_valid)
# we append the scores to the list
rmse_train.append(rmse(y_train, yhat_train))
rmse_valid.append(rmse(y_valid, yhat_valid))
v_best = validation_fraction[np.argmin(rmse_valid)]
v_best
hgb = HistGradientBoostingRegressor(loss = l_best, learning_rate = lr_best, max_iter = i_best,max_leaf_nodes = ln, max_depth = d_best, validation_fraction = v, random_state = 42)
hgb.fit(X_train, y_train)
print_rf_scores(hgb)
#print out the model parameters for later
hgb
predictionshgb = hgb.predict(X_valid)
# create a submission DataFrame
compare_price_hgb = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_hgb['Price'] = y_valid
compare_price_hgb['Predictions'] = predictionshgb
compare_price_hgb['Difference'] = compare_price_hgb['Price'] - compare_price_hgb['Predictions']
#check if all worked fine
compare_price_hgb.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_hgb,
kind="reg", truncate=False,
color="b", height=7)
pred_rf = rf.predict(X_valid)
pred_gbr = gbr.predict(X_valid)
To really outperform in the challenge we will not choose the best model from above and simply make predictions with that. Crazy as we are we will take all of these models and blend together a super model. The idea is adapted from Lavanya Shukla's notebook on kaggle but the code is all self written. This method makes the final predictions more robust to overfitting.
At first we want to try if the concept works and actually gives better results. We will try to blend the predictions of a RandomForest and a GradientBoosting model.
For this stacked/blended model we used RandomForestRegressorand GradientBoostingRegressorwith the best hyperparameters from the chapters 1.8.2 Random Forest Regressor and 1.8.4 Gradient Boosting Regressor.
First step is to fit both models on the train set. For this we copy the tuned models from above.
rf = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=16, max_features=0.5, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=10, min_samples_leaf=1,
min_samples_split=10, min_weight_fraction_leaf=0.0,
n_estimators=300, n_jobs=-1, oob_score=False,
random_state=42, verbose=0, warm_start=False)
rf.fit(X_train, y_train)
print_rf_scores(rf)
gbr = GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
init=None, learning_rate=0.05, loss='ls',
max_depth=30, max_features='sqrt',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=1, min_samples_leaf=6,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_iter_no_change=None,
presort='deprecated', random_state=42, subsample=1.0,
tol=0.0001, validation_fraction=0.1, verbose=0,
warm_start=False)
gbr.fit(X_train, y_train)
print_rf_scores(gbr)
#import sys for the error handling 3555.837
import sys
def blended_prediction(X, w_rf, w_gbr):
"""
Makes predictions with multiple regressors (which are trained in advance) and combines them with a weighting.
The different regression models need to be trained in advance!
- Needs a feature matrix to predict on.
- Returns a pd.series of predictions.
Note: It would be better to store the models in an array and feed them to the function.
"""
#error handling when the weightings are wrong
if (w_rf + w_gbr) != 1:
sys.exit('Error: the weightings must add up to 1')
predictions = ((w_rf * rf.predict(X)) +
(w_gbr * gbr.predict(X)))
return predictions
def print_rmse_score_blended_model(w_rf = 0.85, w_gbr = 0.15):
"""
Generates RMSE score from blended model and print it out.
"""
yhat_train = blended_prediction(X_train,w_rf, w_gbr)
yhat_valid = blended_prediction(X_valid,w_rf, w_gbr)
scores = {
"RMSE on train:": rmse(y_train, yhat_train),
"RMSE on valid:": rmse(y_valid, yhat_valid)
}
for score_name, score_value in scores.items():
print(score_name, round(score_value, 3))
#let the models predict on the validation set and print the score
print_rmse_score_blended_model(0.5, 0.5)
We get a score which is not better than both single models. But also not in the mean of both scores, that's a good sign!
Maybe the score gets better when we look for the best blending ratio. In the iteration below we will check for the ratio which gives the best score on the validation set.
#let both model predict on the validation set
pred_rf = rf.predict(X_valid)
pred_gbr = gbr.predict(X_valid)
#initialize the list for the predictions
preds_list = []
#iterate over each possible ratio
for i in np.arange(0,1,0.02):
yhat_valid = ((i * pred_rf) + ((1-i) * pred_gbr))
preds_list.append(rmse(y_valid, yhat_valid))
#search for the best combination in the lists
minimum = preds_list.index(min(preds_list))
best = np.arange(0,1,0.02)[minimum]
#print out the best score
print('best RF ratio: ' + str(best))
#plot out the rmse score for the different ratios (x axis = ratio of RF)
plt.plot(np.arange(0,1,0.02),preds_list)
This graph shows us the best proportion of both regressors for the stacked/blended model. The value best value given is the percentage of the RandomForestRegressor. Hence, the amount used of the GradientBoostingRegressoris 1 - best value, so that the model performes the best.
#let's test our results from the graph
print_rmse_score_blended_model(0.12, 0.88)
Now we can see a slight improvement to the singel models. So we stick to the plan and try to make a larger stacked model.
predictionssb = 0.12*rf.predict(X_valid)+0.88*gbr.predict(X_valid)
# create a submission DataFrame
compare_price_sb = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_sb['Price'] = y_valid
compare_price_sb['Predictions'] = predictionssb
compare_price_sb['Difference'] = compare_price_sb['Price'] - compare_price_sb['Predictions']
#check if all worked fine
compare_price_sb.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_sb,
kind="reg", truncate=False,
color="b", height=7)
compare_price_sb["Difference"].describe()
Since we have tuned more than just the two models above, we would be stupid to waste our work. We want to blend all models we've tuned in the previous chapters.
#initialize the hgb model
hgb = HistGradientBoostingRegressor(l2_regularization=0.0, learning_rate=0.2,
loss='least_squares', max_bins=255,
max_depth=None, max_iter=655, max_leaf_nodes=500,
min_samples_leaf=20, n_iter_no_change=None,
random_state=42, scoring=None, tol=1e-07,
validation_fraction=0.8, verbose=0,
warm_start=False)
hgb.fit(X_train, y_train)
print_rf_scores(hgb)
xgb_model = xgb.XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=100000, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.5, max_delta_step=0, max_depth=8,
min_child_weight=0, missing=None, monotone_constraints='()',
n_estimators=28, n_jobs=-1, num_parallel_tree=1,
objective='reg:squarederror', random_state=42, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None, eval_metric='rmse')
xgb_model.fit(X_train, y_train)
print_rf_scores(xgb_model)
#let every regressor predict on the validation set
pred_rf = rf.predict(X_valid)
pred_gbr = gbr.predict(X_valid)
pred_hgb = hgb.predict(X_valid)
pred_xgb = xgb_model.predict(X_valid)
#initialize the variables
preds_list = []
best = 10000000
#iterate through all possible ratios
for n_rf in np.arange(0,1,0.02):
for n_hgb in np.arange(0,1,0.02):
for n_xgb in np.arange(0,1,0.02):
#we don't need to calculate the score when the ratios are above 1
if n_rf+n_hgb+n_xgb > 1:
break
#calculate the ratio for gbr
n_gbr = 1-(n_rf+n_hgb+n_xgb)
#blend the predictions to a single validation
yhat_valid = ((n_rf * pred_rf) + \
(n_gbr * pred_gbr) + \
(n_xgb * pred_xgb) + \
(n_hgb * pred_hgb))
#calculate the RMSE score
score = rmse(y_valid, yhat_valid)
#if the score is better than the best we want to save the ratios
if score < best:
best = score
rf_best = n_rf
gbr_best = n_gbr
xgb_best = n_xgb
hgb_best = n_hgb
#print out the results
print('best score = ' + str(best))
print('best value for rf = ' + str(rf_best))
print('best value for gbr = ' + str(gbr_best))
print('best value for xgb = ' + str(xgb_best))
print('best value for hgb = ' + str(hgb_best))
yhat_train = (rf_best*rf.predict(X_train)+ \
gbr_best*gbr.predict(X_train)+ \
xgb_best*xgb_model.predict(X_train)+ \
hgb_best*hgb.predict(X_train) \
)
yhat_valid = (rf_best*rf.predict(X_valid)+ \
gbr_best*gbr.predict(X_valid)+ \
xgb_best*xgb_model.predict(X_valid)+ \
hgb_best*hgb.predict(X_valid) \
)
scores = {
"RMSE on train:": rmse(y_train, yhat_train),
"RMSE on valid:": rmse(y_valid, yhat_valid)
}
for score_name, score_value in scores.items():
print(score_name, round(score_value, 3))
predictionshb = rf_best*rf.predict(X_valid)+ \
gbr_best*gbr.predict(X_valid)+ \
xgb_best*xgb_model.predict(X_valid)+ \
hgb_best*hgb.predict(X_valid)
# create a submission DataFrame
compare_price_hb = pd.DataFrame(columns=['Price', 'Predictions', 'Difference'])
#fill the columns
compare_price_hb['Price'] = y_valid
compare_price_hb['Predictions'] = predictionshb
compare_price_hb['Difference'] = compare_price_hb['Price'] - compare_price_hb['Predictions']
#check if all worked fine
compare_price_hb.head()
#print out a scatterplot with the real price and the difference to our predictions
sns.jointplot("Price", "Difference", data=compare_price_hb,
kind="reg", truncate=False,
color="b", height=7)
compare_price_hb["Difference"].describe()
We now want to compare every model we've trained and then select the best to predict on our test set. It should be the huge blended model. But we want to be shure and prove it.
#first we collect all RMSE scores on validation data from all the tuned models
model_scores = pd.DataFrame(columns=["Model", "RMSE Score"])
model_scores = [ ('RandomForest', 3768.534) ,
('GradientBoostingRegressor', 3555.837) ,
('HistGradientBoostingRegressor', 3854.463) ,
('XGBRegressor', 3883.759) ,
('Small Stacked Model (RF/GBR)', 3551.785) ,
('Huge Blended Model', 3514.15)
]
#Create a DataFrame
model_scores = pd.DataFrame(model_scores, columns = ['Model', 'RMSE Score'])
#plot out the RMSE scores to compare them
plt.title('Model Performance')
plt.xlabel('RMSE')
sns.set_color_codes("muted")
sns.barplot(x="RMSE Score", y="Model", data=model_scores)
As we can see the huge blended model performs the best!! So we will make our final predictions with this.
In most "real world cases" we don't want to just improve our kaggle score. In most cases we want to interpret our score and explain the results and the model behaviour to a customer. For this we will try to find out which features are important and which are not.
From our previous work we guess that the features age, powerPS, kilometer and notRepairedDamage have the biggest impact on the price. But let's figure it out.
Because we couldn't figure out how to adapt the following process to our stacked model we will just analyze the feature importance for the RandomForest model.
model = RandomForestRegressor()
model.fit(X_train,y_train)
#create a function which returns a df with rf feature importance
def model_feature_importance(fitted_model, df):
"""
A function which returns a df with the feature importance of a previosly trained model.
Args:
fitted_model: a previously trained sklearn model
df: pandas data frame with all columns on which we want to know the importance
"""
return pd.DataFrame(
{"Column": df.columns, "Importance": fitted_model.feature_importances_}
).sort_values("Importance", ascending=False)
What feature_importance does is, it shuffels the values in one column and make the predictions again. The importance is the number of how the model score changes for shuffling each features values. Shuffling the values and then look at the performance is better because by just dropping columns relationships between columns get lost.
# expected shape - (n_features, 2)
feature_importance = model_feature_importance(model, X_train)
# print out a bar plot for the feature importance
sns.barplot(y="Column", x="Importance", data=feature_importance, color = 'b')
We can see that our guess wasn't that bad. age, powerPS, and kilometer are actually the three most important features.
We can see two drops of importance (between powerPS / kilometer and monthOfRegistration / notRepairedDamage_ja). That the importance drops at a point is actually a very common characteristic.
Let's see what we would loose when we would drop the features below the second threshold (monthOfRegistration).
#first print the score of the model with all columns
print_rf_scores(model)
#set a threshold for the feature importance
feature_importance_threshold = 0.01
#create a list with columns above the threshold
cols_to_keep = feature_importance[feature_importance['Importance'] > feature_importance_threshold]['Column']
#save the large feature matrix for later
X_valid_copy = X_valid
X_train_copy = X_train
#create a new feature matrix for train and validation set
X_train_reduced = X_train[cols_to_keep]
X_valid_reduced = X_valid[cols_to_keep]
#initialize a simple rf model
model = RandomForestRegressor()
model.fit(X_train_reduced, y_train)
#print out the RMSE score
yhat_valid = model.predict(X_valid_reduced)
rmse(y_valid, yhat_valid)
Our RMSE score has declined by 130 points but the understanding of the model is much better.
Let's see how the model predicts when we just build with 'age' and 'powerPS'.
#create a new feature matrix for train and validation set
X_train_reduced = X_train[['age', 'powerPS']]
X_valid_reduced = X_valid[['age', 'powerPS']]
#initialize a simple rf model
model = RandomForestRegressor()
model.fit(X_train_reduced, y_train)
#print out the RMSE score
yhat_valid = model.predict(X_valid_reduced)
rmse(y_valid, yhat_valid)
We want to compare this score with the most simplest model (predicting the mean)
yhat_valid = pd.Series(y_train.mean())
yhat_valid = yhat_valid.repeat(y_valid.size)
rmse(y_valid, yhat_valid)
With just these two features we are able to predict with a RMSE of 4871 almost twice as good as just predicting the mean.
Like we've guessed at the very beginning of the notebook the predictions are mainly based on just a few features. In a business case we would probably reduce our features to the most necessary ones so we can better interpret and justify our results.
But since this is a Kaggle competition and we want to get the best possible score we want to keep our columns.
#load the test set
test = pd.read_csv(datapath/'test.csv')
# create a submission DataFrame
submission = pd.DataFrame(columns=['Id', 'Predicted'])
#make the Id column same as the test['carId'] column
submission['Id'] = test['carId']
#check if all worked fine
submission.head()
We want to prepare the test set same as the validation set. The only difference is that we apply one hot encoding directly before change the categores into numerical.
# create the age feature
test['dateCrawled'] = pd.to_datetime(test['dateCrawled'])
test['age'] = test['dateCrawled'].dt.year - test['yearOfRegistration']
test.drop(['yearOfRegistration'], axis = 1, inplace = True)
test['age']
#deal with unrealistic value
test.loc[test['age'] < 0, "age"]= 0
test.loc[test['age'] > 106, "age"]= np.nan
test.loc[test['powerPS'] > 1000,"powerPS"]= np.nan
test.isnull().sum()
#fill missing values with the same method we used in the train set
test['vehicleType'].fillna(test['vehicleType'].mode()[0], inplace=True)
test['gearbox'].fillna(test['gearbox'].mode()[0], inplace=True)
test['model'].fillna(test['model'].mode()[0], inplace=True)
test['fuelType'].fillna(test['fuelType'].mode()[0],inplace=True)
test['notRepairedDamage'].fillna(test['notRepairedDamage'].mode()[0], inplace=True)
test['age'].fillna(train['age'].median(), inplace=True)
test['powerPS'].fillna(train['powerPS'].median(), inplace=True)
#drop same columns we droped in the train set
test.drop(['seller'], axis=1, inplace = True)
test.drop(['carId'], axis=1, inplace = True)
test.drop(['offerType'], axis=1, inplace = True)
test.drop(['dateCrawled'], axis=1, inplace = True)
test.drop(['name'], axis=1, inplace = True)
test.drop(['dateCreated'], axis=1, inplace = True)
test.drop(['lastSeen'], axis=1, inplace = True)
test.drop(['nrOfPictures'], axis=1, inplace = True)
test.drop(['abtest'], axis=1, inplace = True)
#apply the function we used already for the validation set to convert strings into categories
convert_strings_to_categories_train(test)
def convert_categories_to_numeric(df):
"""
A function to convert all object columns of a dataframe with the Dtype 'category'
into a numeric Dtype (based on the category codes).
"""
for c in df:
if df[c].dtype.name == 'category':
df[c] = df[c].cat.codes + 1
#apply one hot encoding to the same columns like we did for the test set
test = pd.get_dummies(test, columns=['gearbox','fuelType','notRepairedDamage'])
#apply the function we already used for the train set to make the remaining categorical columns numeric
convert_categories_to_numeric(test)
test.info()
#safe processed data as .csv file
test.to_csv(datapath/'test_final.csv', index = False)
#load the latest version of the processed test set
test_processed = pd.read_csv(datapath/'test_final.csv')
display_large(test_processed.head(3))
#fill the submission frame with predictions (single model)
submission['Predicted'] = (rf_best*rf.predict(test_processed)+ \
gbr_best*gbr.predict(test_processed)+ \
xgb_best*xgb_model.predict(test_processed)+ \
hgb_best*hgb.predict(test_processed) \
)
#save submission frame
submission.to_csv(datapath/'submission.csv', index = False)
Preparation of the datasets is a essential thing of course. There are thousends of methods and procedures to change the date before training the data.
We startet by just checking all the data and realized quickly that there is a lot of non-sense data. Therefore we had to change or even remove them to make the score better. But it took too much time until we knew that it is such a big help to always check what is happening with the RMSE score of the model when changing just one thing. Otherwise after changing several things, nobody knew what actually affected the model and its score. This has cost us a lot of time!
Outliers and Missing Values Further, outliers as well as missing values are not allways bad for predictions. Depending on their distribution. Randomly distributed missing values or outliers are for sure not useful for a predictive model. But they also can be logically distributed, which is in fact also an information to use for the model. And to cut away outliers which can also be found in the test set is not always useful.
One of the most important and most simple learnings was: A model can not predict on data which is not represented in the train set!
It took us some weeks until we realized that, so we had to do some step backs, as it is explained in the notebook about the kNN-method for filling up missing values.
Valid
As we have worked for about 5 weeks on this project, we realized that it would be a huge help to have a valid set which represents the score on kaggle the best (because we applied the same feature engineering on the validation set like on the train set). Before that, we always had to upload a submission to check whether our changes were good or not. Therefore, we have created the valid set at the beginning, before even changing anything on the trainset and then treat the validation set same as the test set.
Organization
Right from the beginning, we created a OneDrive folder where all data, the notebook, the datasets, images and more is saved and available for everyone in the team. But we realized, that OneDrive does not reload that fast. Therefore, changes made by one group member, was not in the notebook of another member.
Furthermore, we messed up the clarity of the notebooks. Hence, we sometimes lost the overview over them. We had several main notebooks, several "sub"-notebooks. this was a high risk.
In next works, we are going to create one main notebook. This is for adapting every improvement, testing out several things and most importantly, having a clear structure with descriptions and so on. Additionally, we will create a second notebook, for example improve_score. This is not as clearly structured as the main notebook, but it is useful to generate a better score. Within in, there are no plots, no codes which are not directly influencing the dataset or the RMSE score, so that the whole improve_scorenotebook needs less time to run through. This will increase our efficiency.
Procedure
We are not used to coding that much. Therefore, when something was working well, we were a little unstructured. Like said before we changed several things before checking again, if the score improved or not. In the future, we are going to check the RMSE after every single change. We learned that when doing two changes for example, then checking the RMSE,the score could be worse because of one change, although the other one actually improved the score. This decreased our efficiency massively.